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ABSTRACT 

Scientific  method  is  a  process  of  guided  learning  in  which  accelerated 
acquisition  of  knowledge  relevant  to  some  question  under  investigation  is  achieved 
by  a  hierarchy  of  iterations  in  which  induction  and  deduction  are  used  in 
alternation. 

This  process  employs  a  developing  model  (or  series  of  models  implicit  or 
explicit)  against  which  data  can  be  viewed.  Ideally  at  any  given  stage  of  an 
investigation,  such  a  model  approximates  relevant  aspects  of  the  studied  system 
and  motivates  the  acquisition  of  further  data  as  well  as  its  analysis.  By  the 
use  of  a  prior  distribution  it  is  possible  to  represent  some  aspects  of  such  a 
model  as  completely  known  and  others  as  more  or  less  unknown. 

Now  parsimony  requires  that,  at  any  given  stage,  the  model  is  no  more  complex 
than  is  necessary  to  achieve  a  desirable  degree  of  approximation  and  since  each 
investigation  is  unique  we  cannot  be  sure  in  advance  that  any  model  we  postulate 
will  meet  this  goal.  Therefore,  at  the  various  points  in  our  investigation  where 
data  analysis  is  required,  two  types  of  inference  are  involved:  model  criticism 
and  parameter  estimation.  To  effect  the  latter,  conditional  on  the  plausibility 
of  the  model,  and  given  the  data,  we  can,  using  Bayes'  Theorem,  deduce  posterior 
distributions  for  unknown  parameters  and  so  make  inferences  about  them.  But, 
before  we  can  rely  on  such  conditional  deduction,  we  ought  logically  to  check 
whether  the  model  postulated  accords  with  the  data  at  all  and,  if  not,  consider 
how  it  should  be  modified.  In  practice,  this  question  is  usually  investigated  by 
inspecting  residuals,  by  other  informal  techniques,  and  sometimes  by  making  formal 
tests  of  goodness  of  fit.  In  any  case  model  criticism,  the  inferential  procedure 
whereby  the  need  for  model  modification  is  induced,  is  ultimately  dependent  on 
sampling  theory  argument.  These  principles  are  formalized  by  an  appropriate 
analysis  of  Bayes'  formula,  and  implications  for  robust  estimation  are  considered. 
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SIGNIFICANCE  AND  EXPLANATION 


Sampling  theory  inference  (e.g.  inference  based  on  sampling  dis¬ 
tributions  of  statistics  and  in  particular  on  significance  tests)  and 
Bayesian  inference  are  usually  thought  of  as  rivals  and  much  effort  has 
been  spent  in  propounding  their  relative  merits.  In  this  paper  it  is 
argued  that  both  kinds  of  inference  are  needed  in  the  scientific  itera¬ 
tion  whereby  knowledge  is  acquired. 

This  iteration  employs  a  directed  alternation  between  induction  and 
deduction  which  uses  model  criticism  on  the  one  hand  and  parameter  esti¬ 
mation  on  the  other.  An  analysis  of  Bayes'  formula  reveals  model  criti¬ 
cism  as  a  sampling  theory  concept  and  parameter  estimation  as  a  Bayesian 
concept.  The  implications  of  these  ideas  for  robust  estimation  are 
discussed. 


The  responsibility  for  the  wording  and  views  expressed  in  this  descriptive 
summary  lies  with  MRC,  and  not  with  the  author  of  this  report. 


SAMPLING  AND  BAYES'  INFERENCE 


IN  THE  ADVANCEMENT  OF  LEARNING 
George  E.  P.  Box 

Today  Statistics  appears  to  be  in  a  somewhat  confused  state*.  The  controversy 
about  Bayesian  inference  and  Sampling  Theory  inference  which  some  believe  involves  a 
critical  choice  is  not  resolved  to  most  people's  satisfaction.  Furthermore  concepts 
such  as  Data  Analysis  and  Robust  Estimation  are  receiving  such  new  emphasis  that  some 
advocates  of  the  "new  Statistics"  are  even  claiming  that  all  else  is  useless  and  old 
hat. 

To  some  extent  the  new  and  admirable  emphasis  on  "looking  at  the  data"  is  a 
reaction  to  previous  extremes.  On  the  one  hand  overemphasis  on  theory  for  its  own 
sake  (mathematistry )  and  on  the  other  a  knee-jerk  approach  to  statistical  analysis 
(cookbookery)  .  Neither  of  these  aberrations  was  healthy  and  some  adjustment  was  long 
overdue.  However  I  think  the  mistake  continues  to  be  made  of  assuming  that  different 
approaches  to  Statistics  are  necessarily  in  an  adversary  position.  I  will  develop  the 
contrary  view  and  try  to  show  how  I  believe  the  pieces  fit  together. 

I  start  from  the  idea  that  Statistics  is  or  should  be  the  art  and  science  of 
building  scientific  models  which  (necessarily)  involve  probability.  Consider  then  ho’; 
such  stochastic  model  building  should  be  done. 


*What  is  happening  is  related  to  the  revolutionary  change  in  computational  speed.  We 
need  to  be  deterred  less  and  less  by  the  number  of  steps  required  in  a  calculation 
although  we  must  correspondingly  increase  our  concern  that  the  human  mind  is  also 
adequately  involved  in  directing  the  tactics  and  strategy  of  investigation. 

^See  discussion  of  "mathematistry"  and  "cookbookery"  in  Science  and  Statistics, 

(Box  1976). 


Sponsored  by  the  United  States  Army  under  Contract  No.  DAAG29-75-C-0024. 


1 .  The  advancement  of  learning  as  an  iteration  between  theory  and  practice 

Although  the  matter  was  over  the  centuries  debated  it  seems  long  ago  to  have  been  agreed 
that  scientific  knowledge  is  efficiently  advanced,  not  by  mere  theoretical  speculation  on  the 
one  hand,  nor  by  the  mere  accumulation  of  empirical  facts  on  the  other,  but  by  a  motivated 
iteration  between  these  two  activities.  In  this  practice-theory  interation  a  tentative  theory 
or  model  suggests  a  particular  examination  and  analysis  of  data  already  existing  or  to  be 
acquired*.  The  results  of  this  examination  will  then  frequently  suggest  a  modified  model 
requiring  further  practical  illumination  and  so  on.  The  advancement  of  knowledge  thus  occurs 
as  the  result  of  an  interplay  between  dual  processes  of  induction  and  deduction  which  carry 
forward  an  iteration  in  which  the  model  is  not  fixed  but  is  continually  changing.  The  stat¬ 
istician's  role  is  to  assist  this  process.  In  doing  so  he  uses  two  inferential  devices  that 
I  will  call  Criticism^ and  Estimation .  The  first  can  induce  model  modification,  the  second 
leads  to  estimation  of  unknown  parameters  assuming  the  truth  of  the  model.  For  illustration, 
in  Figure  1  at  sane  stage  of  an  investigation,  model  is  currently  being  entertained. 

Criticism  involves  a  confrontation  of  with  available  data  y  and  asks  whether 

is  consonant  with  y  and,  if  not,  how  not.  It  is  a  process  of  diagnostic  checking.  It 
may  be  done  informally  using  plotting  techniques  of  various  kinds  often  involving  residual 
quantities  and  more  formally,  with  tests  of  goodness  of  fit.  It  may  suggest  that  model 
modification  to  is  needed.  In  some  instances  it  will  be  judged  appropriate  to  now 

confront  Mi+1  with  the  same  data,  in  others  the  nature  of  the  modified  model  or  necessity 
for  independent  verification  may  indicate  the  need  for  new  data  generated  by  a  new  design 
This  will  be  chosen  to  explore  shadowy  regions  whose  illumination  is  currently 
believed  to  be  important  to  progress. 

Estimation.  If  the  process  outlined  above  leads  to  a  verifiable  model,  that  is  one 
which  when  put  to  the  test  appears  to  provide  an  adequate  approximation  to  reality,  it  may 
logically  be  used  to  estimate  parameters  conditional  on  its  truth.  However  in  practice  this 

*1  shall  suppose  that  data  is  acquired  from  a  designed  experiment  but  the  same  argument  would 
apply  if  data  acquisition  was  from  a  sample  survey  or  even  from  a  visit  to  the  library. 

+The  apt  naming  of  model  criticism  is  due  to  Cuthbert  Daniel. 
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estimation  process  will  be  used  r.ot  only  at  the  termination  of  the  model  building  sequence 
but  at  every  stage  throughout  it.  This  is  because  in  order  to  conduct  criticism  of  the  model 
it  is  often  necessary  to  provisionally  estimate  parameters  at  intermediate  stages,  tentatively 
entertaining  the  model  as  if  it  were  believed  true. 

I  shall  argue  in  this  paper  that  while  criticism  must  ultimately  appeal  to  sampling 
theory  for  its  justification  estimation  requires  the  use  of  Bayes  theorem  (or,  for  the  faint¬ 
hearted,  likelihood).  Acceptance  of  this  position  provides  justification  for  a  specific 
kind  of  sampling  theory  significance  tests  but  none  for  sampling  theory  confidence  intervals. 


2.  Rival  theories  of  inference 


The  distinction  between  inferential  criticism  and  parameter  estimation  has  often  not 
been  made  and  proponents  both  of  sampling  inference  and  Bayesian  inference*  have  long  sought, 
mistakenly  in  my  view,  fora  single  comprehensive  theory.  By  sampling  theory  inference  I  mean 
inference  made  by  referring  some  relevant  function  of  the  data  to  a  reference  sampling  dis¬ 
tribution  which  would  be  appropriate  if  some  specific  hypothetical  model  were  true. 

By  Bayesian  inference  I  mean  inference  made  by  calculation  of  a  posterior  distribution 
obtained  by  the  combination  of  a  prior  distribution  with  the  likelihood. 

Now  it  is  not  surprising  that  a  scientific  discipline  should  have  rival  theories.  This 
is  a  common  phenomenon  and  the  resolution  of  such  rivalries  is  the  stuff  of  scientific  pro¬ 
gress.  But  in  other  subjects  controversies  are  resolved  within  a  decent  interval  of  time. 
What  surely  is  odd,  is  that,  rival  theories  in  Statistics  which  have  been  available  for  more 
than  200  years  should  still  be  in  contention. 

What  I  believe  is  that  both  sampling  and  Bayes  theory  have  important  roles  in  the 
scientific  iteration,  but  these  roles  are  different.  Sampling  theory  is  needed  for  criticism 
of  an  entertained  model  in  the  light  of  current  data  while  Bayes  theory  is  needed  for  making 
inferences  about  parameters  conditional  on  the  adequacy  of  the  entertained  model .  On  this 
view  (see  also  Box  andTiao;  1973)  both  processes  would  have  essential  roles  in  the  continuing 
scientific  iteration  just  as  the  two  sexes  are  required  for  human  reproduction,  it  is  easy 
to  see  that  any  attempt  to  choose  between  two  entities  which  are  not  alternative  but  comple¬ 
mentary  could  certainly  be  expected  to  lead  to  contention,  paradox,  and  confusion  of  the  kind 
we  have  been  experiencing.  The  view  that  more  than  one  mode  of  statistical  reasoning  can  be 
useful  is  not,  of  course,  new  and  in  particular  was  advanced  (however  with  a  different 
emphasis  and  conclusions)  by  R.  A.  Fisher. 


•There  are  other  minor  contenders  but  taking  a  broad  view  these  can  be  regarded  as  schisms 
from  the  two  major  philosophies.  Thus  Savage's  description  of  fiducial  theory  as  "a 
valiant  attempt  at  making  the  Bayesi-.n  omelette  without  breaking  the  Bayesian  eggs"  seems 
justified.  Certainly  fiducial  inference  and  likelihood  inference  are  concerned  with  the 
Bayesian  objective  of  making  some  direct  statement  as  to  the  plausibility  of  different 
values  of  a  parameter.  Also  many  upporters  of  sampling  theory  would  not  necessarily  go 
along,  for  example,  with  all  of  N.  yman-Pearson  theory. 
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3 .  Some  remarks  on  Sampling  and  Bayes  inference 


The  essence  of  what  I  mean  by  "sampling  theory  inference"  is  exemplified  by  the  Shewhart 
quality  control  chart.  The  set  of  limit  lines  for  the  sample  mean  for  example  indicates 
for  this  function  of  the  data,  a  reference  distribution  appropriate  for  the  model  Mq  (for 
the  process  in  control) .  A  single  outlying  point  is  surprising  because  it  is  associated  with 
unusually  low  probability  density.  It  thus  raises  the  possibility  that  MQ  is  inappropriate  and 
that  an  alternative  model  might  be  needed  to  explain  the  inadequacy.  In  the  words  of 

Shewhart,  the  process  is  out  of  control  in  a  manner  which  we  may  be  able  to  attribute  to 
an  assignable  cause.  A  number  of  different  functions  of  the  data  may  be  considered  in 
checking  the  appropriateness  of  MQ  and  their  nature  depends  on  the  type  of  departures  from 
Mg  that  are  in  mind.  Thus  quality  control  charts  are  often  kept  of  both  the  sample  mean 
and  the  sample  range  to  indicate  departures  from  MQ  in  both  level  and  spread  and  other 
functions  such  as  run  length  of  positive  deviations  might  also  be  considered.  Finally 
patterns  which  were  not  foreseen  may  possibly  turn  up,  invite  consideration,  and  induce 
possible  explanations  to  be  subsequently  tested. 

Prior  probabilities  in  Bayesian  and  Sampling  inference 
In  the  past  the  need  for  prior  probabilities  has  often  not  been  thought  of  as  a 
necessity  for  all  scientific  inference  but  rather  as  a  feature  peculiar  to  Bayesian  inference 
Indeed  it  is  often  regarded  by  non-Bayesians  as  the  major  point  of  weakness  of  Bayes  theory 
and  has,  therefore,  been  a  focus  for  attack  and  sometimes  for  derision.  By  contrast  a 
Bayesian  proponent  might  argue  (a)  that  any  theory  of  estimation  worthy  of  the  name  should 
make  it  possible,  given  a  model,  to  say  after  data  had  come  to  hand  what  was  believed 
about  the  values  of  its  parameters  and  (b)  that  what  was  believed  after  the  data  was 
available  must  surely  depend  on  what  was  believed  before  it  was  available  (c)  that  this 
would  include  the  possibility  of  sometimes  using  non-informative  prior  distributions  either 
to  simulate  the  actual  state  of  relative  ignorance  of  the  investigator  or  to  represent  the 
impact  of  the  data  on  a  hypothetical  unbiased  observer  (or  juror) .  He  might  argue  further 
that  the  difficulties  and  paradoxes  that  have  embarrassed  advocates  of  sampling  theory  as  it 


has  been  practiced  and  their  inability  to  fix  up  the  theory  convincingly  have  come  from 


its  past  inadequate  capability  to  include  prior  information. 

Sampling  theory  is  of  course  not  free  from  assumptions  of  prior  knowledge.  Instead  it 
is  as  if  only  two  states  of  mind  have  been  allowed — complete  certainty  or  complete  uncer¬ 
tainty.  Whereas  in  the  sampling  theory  context  a  parameter  had  to  be  treated  either  as 
exactly  known  or  as  completely  unknown,  in  the  Bayesian  context  a  prior  could  be  chosen  to 
approach  either  of  these  extremes  or  any  intermediate  state. 

In  this  connection  it  is  important  to  remember  that  every  simple  model  can  be  thought 
of  as  embedded  in  a  more  complex  one.  For  example  an  outright  assumption  of  normality  can 
be  modelled  by  a  suitable  parametric  family  of  distributions  indexed  by  a  parameter  B, 
which  has  a  sharp  prior  at  the  normal  value.  Independence  of  errors,  so  frequently  assumed, 
can  similarly  be  represented  by  a  sharp  prior  operating  on  a  broader  model  allowing  appro¬ 
priate  dependence.  Seen  in  this  way,  it  appears  that,  when  assumptions  of  normality  and 
independence  are  made  in  sampling  theory,  it  is  not  that  no  prior  knowledge  is  used,  but 
rather  that  implausibly  precise  prior  knowledge  is  implied. 
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4 .  The  model  i s_  the  prior 


Such  considerations  lead  me  to  believe  that  it  is  impossible  to  logically  distinguish 
between  the  model  and  the  prior  distribution.  In  a  real  sense  the  model  is  the  prior.  A 
model  is  a  probability  statement  of  all  the  assumptions  currently  to  be  tentatively  enter¬ 
tained  a  priori.  These  probability  statements  can  express  certainty  or  various  degrees  of 
uncertainty. 

Of  course  models  are  approximations  (good  ones  are  artfully  chosen  approximations  which 
work  well  in  practice).  But  there  is  good  reason  to  believe  that  the  "all  or  none"  prior 
assumptions  implied  in  the  traditional  use  of  a  sampling  theory  are  frequently  too  crude 
even  as  an  approximation.  Indeed  many  of  the  difficulties  of  sampling  theory  which  have 
come  to  light  in  recent  years  may  be  traced  to  the  primitive  means  it  has  available  for 
incorporating  prior  knowledge  and  the  crippling  effect  of  allowing  only  probability  state¬ 
ment  of  a  certain  kind  to  be  included  in  the  model.  One  illustration  of  how  implied  prior 
knowledge  which  is  implausibly  imprecise  can  lead  to  trouble  in  sampling  theory  in  the 
famous  discovery  by  Stein  (1956)  of  the  inadmissibility  of  normal  multivariate  mean,  and 
the  improved  nonlinear  shrinkage  extimators  which  give  smaller  mean  square  error. 

It  is  however  easy  to  miss  the  lesson  which  is  to  be  learned  from  such  examples  .  To 
;  .pecif  c,  consider  the  usual  one-way  analysis  of  variance  set-up.  Here  a  locally  uniform 

prior  distribution  for  the  set  of  group  means  p'  =  (p  ,  p  ,  ....  p p  )  which 

-  ]  2  j  n 

would  exactly  justify  the  sample  averages  as  estimators  makes  little  sense  (see,  for  example, 

Box  and  Tiao  (1968),  Lindley  and  Smith  (1972)).  By  contrast  the  prior  essumption  which 

justifies  the  shrinkage  estimator  is  that  the  p^  are  random  samples  from  some  normal  super 

population  having  unknown  mean  and  variance.  This  corresponds  to  the  usual  "model  II" 

sampling  theory  assumption  and  in  appropriate  circumstances  could  be  eminently  reasonable. 

It  is  crucial  to  notice,  however,  that  there  are  many  circumstances  in  which  this  latter 

assumption  would  not  be  sensible  either. because,  although  prior  knowledge  about  pj,  p^, 

.  .  .  ,  p  existed,  it  was  of  quite  a  different  character.  For  example,  if  the  p's  were 
n 

daily  batch  yields  from  some  production  process,  it  would  usually  be  much  more  sensible  to 


postulate  that  they  followed  some  time  series  model  such  as  a  stationary  autoregressive 

process  (Tiao  and  Ali  (1971)).  The  estimators  then  derived  from  Bayesian  means  ar*  no’ 

Stein's  shrinkaqe  estimators,  which  would  appropriately  introduce  earn’ It-  information  about 
2 

o^,  but  alternative  estimators  allowinq  incorporation  of  relevant  sample  information  about 
the  autocorrelation  of  the  batch  means. 

Sow  sampling  theorists  concede  that  Bayes  theorem  may  be  used  as  a  kind  of  conjuring 
trick  to  produce  efficient  estimators  which  are  then  used  in  a  sampling  theory  context.  In 
this  excercise  they  regard  the  prior  distribution  as  a  convenient  prop  which  is  never  taken 
seriously  and  is  quickly  discarded.  I  think  the  example  quoted  above  is  one  of  many  which 
shows  that  this  idea  has  no  rational  status.  For  it  illustrates  that  there  is  not  one  set 
of  "shrinkage  estimators"  but  an  infinity  of  such  sets  depending  (very  naturally)  on  the 
model  (that  is  the  prior)  which  is  appropriate  to  describe  the  particular  scientific  situa¬ 
tion  under  study. 

The  strength  of  the  explicit  statement  of  prior  assumptions  is  that  in  the  iterative 
model  building  process,  they  make  manifest  at  every  stage  exactly  what  assumptions  are 
tentatively  entertained  and  so  allow  them  to  be  criticized.  Some  of  the  nervousness 
experienced  by  non- Bayesian s  confronted  with  the  idea  of  a  prior  distribution  has  perhaps 
arisen  because  the  iterative  nature  of  scientific  process  and  consequent  tentative  tran¬ 
sitory  character  of  models  and  all  their  assumptions,  has  not  been  generally  understood. 

Many  of  us  were  taught  to  think  unrealistically  in  terms  of  "one  shot"  pro-  ;dures . 

The  sequence:  frasie  hypothesis  -  collect  data  -  test  hypothesis/make  decision;  of 
course,  fails  to  describe  the  usual  context  in  which  Statistics  is  applied. 

Critics  have  therefore  feared  gross  mistakes  arising  from  adamantine  prior  prejudice 
which  ignored  "what  the  data  were  trying  to  say."  In  the  iterative  context  of  real  scienti¬ 
fic  enquiry  however  gross  mistakes  about  the  prior  or  any  other  aspect  of  the  model  will 
usually  be  corrected  at  the  criticism  phase  of  the  next  iteration. 
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5*  TVo  i'omt-'lemt.'nf.jrv  t'  irt.M  s  from  Bayes  formula 

If  we  a.-copt  ■  the  prior  irobability  distribution  of  paramete.  s  0  as  an  essential  part 
of  a  model  then  all  aspects  of  the  model,  hypothesized  at  some  particular  stage  of  an  investi¬ 
gation,  are  contained  in  the  joint  density  obtained  by  combining  the  likelihood  and  the  prior 

PlV. A |M)  =  p(y| H,M)  •  p ( 9 | M)  (5.1) 

where  |M  is  understood  to  indicate  conditionality  on  some  aspect  of  the  model  and  y  is  a 
data  vector. 

This  joint  distribution  which  is  a  comprehensive  statement  of  the  model  can  also  be 
factored  as 

p(y,0  |M)  -  p(0  |y  ,M)p(y  |m)  (5.2) 

and  can  be  computed  before  any  data  he. one  available.  In  particular  the  second  factor  on 
the  right  p(y|»i)  -  /  p(y| 0,M)p(0|ll)d0  ,  (5.3) 

which  is  the  predictive  distribution,  may  be  so  calculated.  It  is  the  distribution  of  the 
totality  of  all  possible  samples  that  could  occur  if  the  model  M  were  true. 

When  an  actual  data  vector  y^  becomes  available 

P(yd,e|M)  «  p(e|yd,M)p(yd|M) .  (5.4) 

The  first  factor  on  the  right  is  then  Bayes*  posterior  distribution  of  0  given  yd 

p(0|Xd,M)  -  kp(yd|0,M)p(e|M)  (5.5) 

and  the  second  factor 

p(y  I M)  -  /  p(y.|e,M)p(e|M)d6  -  k-1  (5.6) 

-a  -d  '  Vd 

is  the  predictive  density  associated  with  the  data  set  y  actually  obtained.  Figure  2 

-a 

illustrates  for  a  single  parameter  0  and  a  sample  y .  of  n  *  2  observations. 

-a 

If  the  model  is  to  be  believei,  then  the  posterior  distribution  p(0|y.,M)  allows  all 

-  -a 

relevant  estimation  inferences  to  be  made  about  6 ■  However  even  if  the  model  were  totally 
incorrect,  this  could  not  be  shown  by  any  abnormality  in  this  factor  which  is  conditional  on 
both  data  and  model  specification.  However  plausibility  or  otherwise  of  obtaining  such  a 
sample  if  the  model  were  appropriate  may  be  assessed  by  reference  of  the  density  p(yd|M)  to 
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the  predictive  reference  distribution  p (y | M ) .  An  unusually  small  value  of  p(y  'm)  as 

measured  by  Pr{p(y|M)  <  pty^M)}  casts  doubt  on  the  appropriateness  of  the  model  M.  Now 

p(y|M)  is  an  n-dimensional  distribution  and  it  will  usually  be  true  that  if  the  model  is 

inadequate  it  is  most  likely  to  be  deficient  in  certain  directions  associated  with  unusual 

values  of  certain  specific  functions  g^(y)  of  the  data.  Examples  of  such  functions  are 

sample  averages,  variances,  moment  coefficients,  coefficients  of  serial  correlation,  and  other 

measures  of  standardized  deviations  from  a  norm.  In  every  case  the  appropriate  reference 

distribution  to  which  the  realized  statistic  g. (y J  should  be  referred  is  the  distribution 

l  ~d 

ptgMylM)},  when  the  model  M  is  true,  derived  by  appropriate  integration  o."  p (y ! M ) . 

In  practice,  criticism  or  diagnostic  checking  of  the  model  is  often  conducted  by  visual 
inspection  of  residual  displays  and  other  more  sophisticated  plots.  But  such  a  process, 
although  it  is  informal,  still,  it  seems  to  me,  falls  within  the  logical  framework  described 
above.  The  statistician  is  looking  for  "features"  in  the  data  which  would  be  surprising  or 
"unusual"  if  the  model  M  were  true.  Such  a  feature  can  be  described  by  a  function  g(y.) 

»  u 

and  its  unusualness,  if  formalized,  would  have  to  be  measured  by  reference  to  p{g(y) |m}. 

In  addition  to  possible  discrepancies  to  which  we  have  been  alerted  by  experience, 
other  features  may  appear  pointing  to  inadequacies  of  a  kind  not  previously  suspected.  This 
possibility  has  sometimes  proved  perplexing  for  statisticians,  for  while  on  the  one  hand  the 
truly  unexpected  could  point  the  way  to  precious  new  knowledge,  on  the  other,  associated 
probabilities  will  be  indeterminate  because  of  the  uncountable  character  of  other  features 
that  might  also  have  been  regarded  as  surprising.  I  think  the  calculation  which  ignores 
this  difficulty  of  indeterminate  selection  should  still  be  made,  for  while  it  might  lead  to 
the  too  frequent  pursuit  of  nonexistent  assignable  causes,  the  iterative  process  will 
quickly  terminate  this  chase  and  carrying  out  the  exercise  will  at  least  eliminate  phenomena, 
which  at  first  sight  look  surprising  but  really  are  not.  For  example.  Feller  (1968)  shows 
that  for  a  random  group  of  30  people,  the  probability  that  at  least  two  have  coincident 
birthdays  is  over  70%,  this  tells  us  we  need  look  no  further  for  an  explanation  when  we  are 
surprised  to  find  two  such  people  at  a  party. 
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Example :  Unknown  mean  0,  variance  assumed  known. 

Consider  a  sample  of  n  observations  drawn  randomly  from  a  normal  distribution 

2 

with  unknown  mean  0  and  known  variance  a  .  We  express  uncertainty  about  the  mean 

2 

by  supposing  that  a  priori  0  is  distributed  normally  about  0Q  with  variance 


p(y|6,M)  =  (2w )  a  nexp|-- 


E(y{  -  6)' 


"f  -1  i  (0  -  6n)' 

P<6|M)  =  (2n)  V- 1exp^-4 - r-- 


The  posterior  distribution  from  which  0  may  be  estimated  conditional  on  the  adequacy 
of  the  model  is  then 

_l  1 

p(0|y,M)  =  (2u)  2("7  +  -^expj-j  Pj  +  "7  <6  -  ®>2}  <5.9) 

0-0  '  ^-0  o^  * 


where  5  =  (—  +  ~)~  {—  0Q  +  y) 


v  2  2J  v  2  0  2 
0  0  0  0 


The  predictive  distribution  which  can  act  as  reference  distribution  for  the  observed 
data  vector  y^,  thus  allowing  criticism  of  the  model,  is 


_  2  _-(n-l)  ~ 2  (  2  .  a0-,_2  (  1  f(n  -  l)s2  .  (y  "  V  1) 

-(2t)  o0  n  («  ♦-)  expj-j  - - -  +  ^ 

v  CJ„  a.  +  0_/nJ/ 


(5.10) 


And  the  probability 


P  =  Pr{p(y  |m)  <  p  (y.  |  M) }  =  Pr{X  >  C} 
~  ~  a  n 


(n  -  l)s2  (V  ~  V 
2  2  2 

°o  °0  +  V" 


supplies  an  overall  portmanteau  check  on  model  fit. 

-  2 

Obvious  sample  functions  for  checking  individual  features  of  the  model  are  y,s 


and  suitably  chosen  functions  of  standardized  residuals  r  =  (r,  ,...,r  )  with 

'  i  n 

r^  =  (y^  -  y)/s  i  =  l,...,n.  The  choice  of  these  residual  functions  9j/g2' ' ' " '9k' 
will  depend  on  the  context.  They  will  include  the  standardized  residuals  r 


themselves,  but  might  also  address  the  need  to  apply  checks  for  "bad  values",  skewness, 

kurtosis  and  serial  correlation,  for  example.  The  standardized  residuals  in  the  form 

2 

defined  above  are  constrained  by  the  identities  £r  =  0,  Er.  =  n  -  1  and  can  be  more 

l  l 

conveniently  parameterized  in  terms  of  n  -  2  independently  distributed  functions 
obtained  as  follows: 

Make  an  orthogonal  transformation  from  y  to  Y  =  (Y^, . . . Y^) '  with  Y^  =  >^ny 

-  2 

and  then  transform  to  y,s  and  u  where  u  is  a  vector  of  n  -  2  residual 

quantities  u  =  (u, ,u  , ...,u  _)  such  that 

-  12  n-2 


u.  =  Y . 

3  3+1 


/{j/S/f- 


-  2 


The  Jacobian  of  the  transformation  from  y  to  y,s  ,u  is  proportional  to 


n-1 

(s2)  2 


-1  n-2  -f(j+l) 

II  {1  tu./j)  .  After  transformation  therefore  the  predictive  distri- 

j  =  l  3 


bution  contains  n  elements  all  of  which  are  distributed  independently  and  becomes 


“2  —2 
p(y,s  ,u|m)  =  p(y|M)p(s  |m)p(u|m) 


(5.11) 


where 


p(y|M)  =  (2*)  2(oQ  +  oQ/n)  exp(-j(y  -  9q)  /(o0  +  oQ/n) ) 


(5.12) 


P(S2|M)  =  (n-  1)|2  r_1|i(n- l)|(o2)  2<  1>{s2}  2  expj- 1  <n  -  l)s2/o2} 


1  .  , .  n-l 

”  2  (n_1)  2.  2  _1 

is  }  exp 


(n-l) 


p ( u | m )  =  r  2(n  -  i)  (it)  2  r( 


U21- J(j+1) 


(5.13) 

(5.14) 


The  standardized  residual  quantities  of  interest  ‘  "  ,<3k  Can  t*len  ^  exPres8ed 

equally  as  functions  f  (u) , f  (u) , . . . , f  (u)  of  the  u' s.  So  that,  in  particular, 

1  *■  2  K  ” 

-  2 

unusual  features  of  y,s  ,  and  g^,...,g^  given  the  model  could  be  assessed  by 
computing 

(i)  Pr(p(y|M)  <  p(yd|  m)  ) 

(ii)  Prip (s2| M)  <  p(s2|M)} 

(iii)  Pr{p(g^|M)  <  p(g^d|M)}  j  =  1,2, ...,k  . 
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These  are  the  (two  tail  area)  probabilities  associated  with  reference  of 

1 


(i)  <y 


d 


2  2  2 

9o>/(°0  +  Vn>  t0  tfle  Normal  table 


(ii)  (n  -  l)s 2/a2  to  the  X2  table 
a  u 

(iii)  to  the  reference  distribution  obtained  by  appropriate  integration 

of  the  distribution  p(u|«). 

-  2 

They  yield  checks  on  the  adequacy  of  the  model  which  we  denote  by  c(y),c(s  ),clg J. 

For  example  suppose  the  yield  of  a  batch  process  was  under  study  and  that  a 

sample  y  was  available  of  n  observations  all  from  a  single  batch  having  unknown 

mean  0.  Suppose  at  this  stage  of  the  investigation  that  the  tentative  model  assigned 

that,  because  of  process  variation,  batch  means  varied  Normally  and  independently  about 

2 

some  value  0Q  with  variance  og  and,  because  of  testing  variation,  the  ith  observation 

2 

yi  varied  about  6  normally  and  independently  with  variance  oQ.  Then  the  model  would 
be  that  discussed  above  and,  if  this  model  could  be  believed,  the  batch  mean  6  would  be 

^  a,  g 

estimated  by  the  posterior  distribution  N{6, (i  +  i_)  }  where  I-  «  no.  ,  I  «  o.  . 

0  y  y  0  0  0 

And,  if  we  write  w  ■  I-/ ( +  I_)  for  the  proportion  of  the  information  coming  from 
the  sample,  then  0  -  wy  +  (1  -  w)0Q. 

Before  drawing  such  a  conclusion  however  a  prudent  statistician  would  question  the 

.  2 

model.  In  particular  applying  the  checks  c(y),c(s  ),c(gj, 

(i)  an  unusually  small  value  of  p (y |m>  could  call  into  question  the 

2  2 

choice  of  some  or  all  of  0_,oQ  and  o' 

0  w  0 

2 

(ii)  an  unusually  small  value  of  p(s  | M)  could  call  into  question  the 
choice  of  a2. 

(iii)  an  unusually  small  value  of  p(g^|M)  could  suggest  departures  from 
the  assumed  distributional  form  p(y|0,M)  produced  by  serial 
correlation,  bad  data  values,  non-normality,  etc. 

Only  after  the  investigator  had  found  that  the  evidence  offered  by  the  data  did  not 
invalidate  the  model  should  he  proceed  to  make  the  conditional  deductive  inference 
supplied  by  Bayes  theorem. 
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6.  Some  implications 

Consider  the  problem  of  making  inferences  about  6  in  Uu-  ;  rvvious 
example.  If  we  assume  the  model  true  then  we  can  estimate  9  from  a  normal  posterior 

where 

w  =  I_/(I  -  I-)  is  the  fraction  of  the  information  coming  from  the  sample.  First 


distribution  with  mean  9  =  wy  +  (1  -  w)0.  and  variance  (I  +  I-) 

o  “  y 


-i 


VS  y 

however  we  require  to  check  the  model  using  the  predictive  distribution.  In  particular 

1 

-  -227 

the  check  c(y)  requires  a  reference  of  (y.  -  9.)/(a.  +  a  / n)*  to  the  normal  table. 

d  0  9  0 

2  2 

Significance  test.  Suppose  o  is  assumed  small  compared  with  a  /n,  then  w,  the 

D  0 

relative  amount  of  information,  supplied  by  the  data  is  small  and  1  -  w  is  close  to 
unity.  Then,  if  this  model  can  be  relied  upon,  the  posterior  distribution  is  essentially 
the  same  as  the  prior  and  is  sharply  centered  at  0Q.  (A  practical  context  is  one  where 
the  statistician  is  told  that  process  variation  is  negligible  compared  with  testing 
variation  and  the  process  mean  is  known  to  be  0Q.)  If  this  model  is  assumed,  then 
information  from  available  data  y  can  add  very  little  to  what  is  known  already. 

However,  it  can  deny  the  relevance  of  this  model.  In  particular  c(y)  involves  the 
reference  of  (y  -  to  normal  tables;  the  failure  of  this  check  means 

that  the  model  is  discredited  and  therefore  the  operation  that  leads  to  a  sharp 
posterior  distribution  centered  at  9Q  may  not  logically  be  undertaken. 

The  above  most  satisfactorily  explains  to  me  the  rationale  of  a  significance  test, 

(i)  The  tentative  model  (null  hypothesis)  implies  that  0  *  0Q. 

(ii)  A  chec).  on  this  aspect  of  the  model  is  provided  by  reference  of 
(y  -  0O )/ (Oq/Vii)  to  the  Normal  Table. 

(iii)  If  the  tail  area  probability  is  not  small  we  do  not  question  the 

model.  The  application  of  Bayes  theorem  then  produces  a  posterior 
distribution  which  is  a  delta  function  at  0Q.  We  have  "no  reason 
to  question  the  null  hypothesis". 

(iv)  If  the  tail  area  probability  is  small  we  conclude  that  the  model 

which  postulates  that  0  •  0Q  is  discredited  by  the  data  and  that 
some  other  model  is  appropriate.  The  "null  hypothesis  is  rejected." 
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tv)  Notice  too  that  although  the  failure  of  this  check  would  most 


immediately  proscribe  the  use  of  Bayes  theorem,  the  failure  of 

2 

other  checks  (and  of  cts  j  in  particular)  would  also  indicate 
the  necessity  of  model  modification  before  proceeding  further. 

A  difficulty  that  this  removes  for  me  is  that,  as  usually  formulated,  significance 
tests  seem  to  provide  no  basis  for  belief.  On  the  above  argument,  i_f  we  accept  the 
model,  we  believe  a  priori  that  6  is  close  to  0Q.  We  must  therefore  believe  that 
0  =  0Q  very  nearly  a  posteriori.  The  availability  of  data  provides  however  an  oppor¬ 
tunity  to  assess  the  concordance  of  data  and  model. 

The  significance  test  itself  provides  a  means  only  of  discrediting  the  model. 

Our  belief  in  the  proposition  0  =  0Q  comes  from  an  application  of  Bayes  theorem  for 
a  model  which  there  is  no  reason  to  question  (as  a  reasonable  approximation  to  truth) . 

In  particular  this  underscores  the  illogicality  of  testing  a  null  hypothesis 
which  is  not  credible  to  begin  with.  Thus  the  Durbin-Watson  test  for  serial  correla¬ 
tion,  for  which  the  null  hypothesis  is  that  errors  are  distributed  independently,  is 
frequently  misapplied  to  test  serial  data  which  a  priori  can  be  expected  to  be 
autocorrelated. 

Precise  measurement  and  improper  priors 

2  2 

Suppose  now  that  o  was  very  large  compared  with  o  /n.  The  predictive  check 
o  U 

c(y)  now  approaches  (y.  -  0„)/o„  implying  that  for  sets  of  data  having  widely 

a  0  0 

different  sample  averages  the  model  would  not  be  called  into  question.  The  situation 
where  such  a  non- informative  prior  distribution  was  relevant  was  referred  to  by  L.  J. 
Savage  as  that  where  the  theory  of  precise  measurement  applied.  The  invocation  of 
this  principle  might,  at  first,  seem  a  license  to  use  Bayes  theorem  without  any  restrain¬ 
ing  checks  of  the  model.  But  this  idea  makes  no  sense  either  from  an  applied  or  a 
theoretical  point  of  view. 

The  practical  situation  is  that  the  sample  information  coming  from  y  must  be 
evaluated  in  a  context  where  there  is  relatively  very  little  prior  information  about 
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the  value  of  0. 


Here  computational  convenience  and  logic  must  of  course  be  carefully  distinguished. 

Replacing  "relatively  very  little"  by  zero  can  be  justified  computationally  in  those 

circumstances  where  to  do  so  provides  a  good  numerical  approximation  but  not  otherwise. 

However  in  either  case  zero  remains  infinitely  smaller  chan  any  small  quantity.  In 

this  example,  substitution  of  an  improper  uniform  prior  will  produce  a  normal  posterior 

-  2 

distribution  having  mean  y  and  variance  oQ/n ,  also  obtained  as  the  limit  wheiv  in  our 

model,  the  fraction  of  information  w  supplied  by  the  data  tends  to  unity.  But  not  only 

2 

that,  the  specification  of  the  prior  for  9  as  N(9  ,o.)  is  obviously  overly  specific, 

0  o 

and  the  improper  prior  could  provide  an  appropriate  limit  for  disperse  priors  which  were 

widely  different  in  structure  and/or  much  less  specific. 

All  statistical  results,  in  so  far  as  they  relate  to  reality,  are  approximations. 

Those  obtained  from  improper  priors  do  in  many  important  examples  provide  excellent 

approximations.  I  hasten  to  add  of  course  that  limiting  processes  can  be  tricky  and 

theoretical  statisticians  are  right  to  worry  about  them. 

Notice  however  that  the  situation  is  different  for  the  predictive  check.  To  say 

2 

that  w  is  close  to  unity  is  only  to  say  that  a  will  dominate  the  denominator  in 
1  0 
2  2 

_  2  °0  2 

(y  -  9„)/(a.  +  — 1  .  But  to  say  that  it  is  equal  to  unity  implies  that  o  is 
u  v  v  n  o 

infinite  and  the  check  cannot  be  made,  which  implies  that  there  are  absolutely  no  values 
of  y  which  could  discredit  the  model  -  a  situation  which  I  cannot  imagine  as 
practically  possible. 

Consider  for  example,  a  physical  chemist  who  runs  experiments  to  determine  the 
activation  energy  6  for  a  particular  chemical  reaction  about  which  little  is  known. 

It  would  usually  be  true  that  his  initial  uncertainty  about  9  would  be  large 
compared  with  the  anticipated  standard  deviation  a of  the  experimental  procedure, 
the  theory  of  precise  measurement  would  apply  therefore  and  the  limiting  result 
obtained  from  the  usual  improper  prior  would  supply  a  good  approximation.  Nevertheless 
the  chemist  may  know  that  activation  energies  for  compounds  of  the  kind  being  tested  are 
usually  measured  in  tens  of  kilo  calories  per  gram  mole.  If  the  statistician, 

who  has  perhaps  misplaced  a  decimal  point,  presents  him  with  an  estimate  of 


say  0.1  kilo  oalorios  |  i-r  gram  mole  ho  will  rightly  reject  it.  In  doing  so  he  will  be 
informally  conducting  a  check  formalized  by  c(y).  In  practice  thei\  checks  such  as 
c(y)  can  never  really  be  dispensed  with.  The  non-informative  prior  used  in  practice 
must  to  make  practical  sense  always  be  proper,  but  nevertheless  the  appropriate 
posterior  distribution  can,  in  suitable  circumstances,  be  numerically  approximated  by 
the  device  of  substituting  an  improper  prior.  I  labour  this  point  because 
although  it  has  been  made  earlier  (see  for  example  Box  and  Tiao  1973,  p.  28)  critics 
seem  to  have  misunderstood  earlier  discussions.  Explicit  consideration  of  predictive 
checks  makes  the  situation  even  clearer. 

Choosing  the  diagnostic  checks 

Frequently  the  checking  functions  g(y)  which  are  to  be  used  formally  or  inform¬ 
ally  for  checking  various  features  of  a  model  M  are  chosen  on  an  ad  hoc  basis. 

One  formal  basis  for  selection  of  such  functions  follows  essentially  the  route 
explored  by  Neyman  and  Pearson.  Suppose  a  basic  model  MQ  is  given  and  an  alternative 
model  represents  some  discrepancy  from  MQ  which  is  of  interest.  Then  a  function 

of  the  data  suitable  for  detecting  such  discrepancies  may  be  obtained  from  the  ratio^ 

p(ydlM0)/p(ydlMi) 


Parsimony;  Diagnostic  checks  versus  Robustification 

A  question  which  confronts*  the  statistician  at  every  stage  of  an  investigation  is 
"How  complex  a  model  should  I  use?"  The  possibilities  for  model  elaboration  are  of 
course  limitless.  For  instance  a  commonly  used  model  assumes  errors  to  be  Independently, 
Identically  and  Normally  distributed  (IIN).  It  is  easy  to  imagine  a  sequence  of  fall¬ 
back  models  which  might  begin  like  this 

M0  M1  H2  '*  M3  ■*  •  •  •  • 

IIN  Iijt  XI#  XX# 


+Model  criticism  cannot  logically  be  conducted  by  the  study  of  the  magnitude  of  such 
ratios  however,  for  even  if  t-his  ratio  were  very  high  the  predictive  check  could 
still  show  the  favored  model  to  be  highly  implausible. 

*An  apparently  different  question  is  "Should  I  use  a  robust  procedure?",  but  I  will 
argue  that  this  is  subsumed  by  the  broader  question. 
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At  each  stage  of  elaboration  there  are  many  forms  the  modified  model  could  t eke  and 
most  require  additional  parameter  values  either  to  be  given  from  prior  knowledge-  or  to 
be  estimated  from  the  data.  Obviously  compromise  is  necessary;  for,  on  the  one  han<4 
simpler  models  can  allow  better  scientific  understanding  and  better  estimation,  while, 
on  the  other  hand,  more  complex  ones  can,  but  need  not,  be  closer  to  the  truth.  A  con¬ 
solation  is  that,  realistically,  model  building  is  iterative,  so  that  mistakes  can  be 
rectified. 

This  fact  of  necessary  compromise  raises  the  dilemma  of  where  should  the  compromise 
be  made,  that  is  to  say,  of  what  should  be  left  out  and  what  be  included.  In  particular 
suppose  some  deviation  from  an  "ideal"  model  MQ  can  be  parametrised  by  a  crepar.cy 
parameter  B  or  a  vector  of  such  parameters. 

For  illustration  MQ  might  be  the  usual  normal  aodel  and  6  could  measure 

(i)  possible  serial  correlation  of  errors  (e  ,et,et+^,...)f  for 

instance,  the  serial  correlation  might  be  generated  by  a  first  order 
autoregressive  process  efc  «  Befc  j  +  at  where  at  was  a  source  of 
discrete  white  noise. 

(ii)  possible  deviation  from  error  normality;  for  example* according  to 
p(e|o,B)  =  const  o  1exp[-  j {e2/a2}ly/ ] 

(iii)  need  for  parametric  transformation;  for  example  the  normal  linear 

a 

model  would  be  valid  not  for  y  but  for  y  . 

(iv)  need  to  allow  for  bad  values;  for  example  with  probability  6 

2 

(close  to  unity)  the  error  variance  was  o  ,  with  probability 
1  -  B  it  was  k2o2. 

In  each  case  there  are  two  ways  to  handle  the  possible  model  discrepancy,  depending  or. 
whether  the  parameter  B  is  omitted  fronv  or  included  in,  the  model.  We  call  these 
diagnostic  checking  and  robustificatlon. 

Diagnostic  checking.  If  the  discrepancy  parameter  is  omitted  from  the  model  then  an 
appropriate  diagnostic  check  can  be  made.  Formally  this  would  be  done  by  referring 

*Here  and  elsewhere  other  functional  forms  might  be  found  more  appropriate.  These 
examples  are  intended  only  to  illustrate  essential  principles;  not,  of  course,  to  be 
comprehensive . 
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some  suitable  function  g(y)  of  the  data  to  a  reference  distribution  derived  for  the 
predictive  distribution  p(y|MQ). 

Robust ificat ion.  If  the  discrepancy  parameter  is  included  then  robust  estimation*  of 
0  is  provided  by  the  posterior  distribution 

p ( 0 1  v)  =  /  p(0 1  6, y )p(8  j  y) dB  (6.1) 

If  we  write 

Pu<6|y)  =  p(6|y)/p(6)  (6.2) 

P<e|y>  =  /  p(0  j8,y)p  (6 ly)p(B) d6  (6.3) 

In  this  last  expression 

(i)  p ( 6 )  can  be  chosen  to  represent  approximately  the  probability 
of  occurrence  of  different  values  of  B  in  the  real  world 

(ii)  the  function  p^ ( S | y )  is  a  pseudo- likelihood  which  reflects 
information  about  6  supplied  by  the  data 

(iii)  considered  as  a  function  of  B,  p(9|8,y)  reflects  the  sensitivity 
of  estimation  to  the  choice  of  the  discrepancy  parameter. 

The  omission  of  the  parameter  B  is  equivalent  to  setting  it  equal  to  the  value 
0Q  which  it  takes  in  the  ideal  model  Mq.  Table  1  shows  some  examples  of  diagnostic 

checks  and  corresponding  robust  estimation  methods.  A  fuller  discussion  is  given 
elsewhere  (Box  1979). 

Discussion.  There  may  be  Bayesians  who  would  deny  the  need  for  diagnostic  checks  based 
on  sampling  theory.  They  may  feel  that  "they  can  do  it  all  with  Bayes".  I  do  not 
believe  this  position  can  be  sustained  because  it  implies  either 
(i)  that  they  know  what  the  model  is  in  advance  or 
(ii)  that  they  are  prepared  to  make  the  model  so  comprehensive  that  nothing 
could  possibly  be  overlooked. 


‘Numerous  authors  (Huber,  Tukey,  Andrews,  Hampel,  etc.)  have  proposed  ad  hoc  methods 
of  robust  estimation  relying  on  the  empirical  modification  of  classical  estimation 
procedures.  It  seems  more  logical  to  me  to  modify  the  model  which  is  presumably  at 
fault  rather  than  the  method  of  estimation  which  is  not.  Furthermore  this  has  the 
advantage  of  clearly  revealing  the  assumptions  which  are  being  made. 
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Both  positions  are  grandiose  and  unrealistic  and  the  second  if  attempted  could  lead 
to  unnecessarily  complicated  models  which  would  impede  scientific  progress. 

In  this  connection  it  must  be  realized  that  looking  at  residuals  is  essentially  a 
sampling  theory  procedure  and  is  an  acknowledgement  of  the  often  happy  fact  that  an 
experiment  might  reveal  more  than  was  bargained  for.  To  put  it  another  way,  every 
Bayesian  statement  is  conditional  and  somewhere  there  has  to  be  an  anchor. 

An  acceptance  of  my  theme  implies  of  course  that  what  is  tentatively  included  in 
a  model  is  a  matter  of  judgement.*  However  we  can  still  look  for  guidelines  for  model 
building  on  what  to  tentatively  include  (robustify  for)  and  what  to  tentatively  omit 
(and  later  check  for). 

Obviously  the  need  for  special  features  in  the  model  depends  on  the  context,  e.g.: 

(a)  serial  data  (in  particular  most  economic  and  business  data)  cannot  reasonably  be 

expected  to  be  represented  by  a  model  with  uncorrelated  errors,  autocorrelation  is 

virtually  certain  (temporary  changes  in  mean  and  variance  are  also  very  likely  in 

serial  data),  (b)  data  for  which  y  /y  .  is  large  is  likely  to  need  transformation 

max  min 

before  any  simple  model  could  apply,  (c)  most  experimental  data  are  liable  to  occa¬ 
sional  bad  values.  Elaborations  which  are  primary  candidates  for  robustif ication 
(inclusion  in  the  model)  reflect  features  which  might  easily  elude  diagnostic  checks 
and  could  then  invalidate  subsequent  analysis. 

Although  the  ad  hoc  robustifiers  seam  to  have  given  all  their  attention  to 
possible  non-normality  of  (assumed  independent)  observations,  an  even  greater  source 
of  serious  trouble  is  autocorrelation  in  serial  data.  See  for  example  Coen,  Gomme  and 
Kendall  (1968),  Box  and  Newbold  (1971),  Pallesen  (1977),  Bjx  and  Jenkins  (1970). 


*This  idea  that  a  statistician  has  to  use  scientific  judgement  is  no *  a  universally 
popular  one.  The  objectivity  of  statistics  like  that  of  science  c  ,  s  not  of  course 
mean  that  all  statisticians  (or  scientists)  even  though  capable  of  >  sing  the  same  set 
of  tools  will  do  equally  well  when  using  them.  Just  as  there  are  qood  lawyers  and  bad 
lawyers,  there  are  good  statisticians  and  poor  ones. 


APPENDIX 


2 

Another  Example:  8  known,  a  unknown 

2 

Suppose  now  we  have  a  known  mean  9  but  unknown  variance  o  .  Also  suppose  we 

2 

express  uncertainty  about  the  variance  by  assuming  a  priori  that  o  is  distributed 

2  ,-2 

about  sQ  in  a  scaled  distribution  having  vq  degrees  of  freedom.  This  is 

2  2 

equivalent  to  assuming  that  a  supposedly  relevant  estimate  sQ  of  c  having 
degrees  of  freedom  is  available  from  past  data  and  has  been  assessed  against  a  non- 
informative  reference  prior  (i.e.  prior  to  the  first  sample  the  distribution  of  log  o 
was  flat  in  the  neighborhood  of  the  likelihood) .  Then  for  a  prospective  sample  of 
n  =  v  +  1  observations 


2  2  2 
P(y|o,M)o(o  )  exp 


1  2  -  „  2 
-  jvs  +  r.(y  -  9) 


1  ^ 
Ms2)  2exp 

1  2n 

"2  Vo 

2 

a 

(A.  1 ) 


(A. 2) 


The  complete  prospective  statement  about  the  model  is  thus 


v  rn  +  vrt 

—  - — ^  +  1 
, 2  <fl2,  L  2  J 


p(y,o2|M)a(Sg)  2  (a2)  !  2 


« 2  -  2  2  2 
where  o  =  (n(y  -  0)  +  vs  +  voS0^/^n  +  " 


exp 


r-i(v0  ♦  n>a2 


(A.  3) 


When  actual  data  becomes  available  then  conditional  on  the  acceptance  of  this 

model  inferences  about  o2  must  be  made  from  the  posterior  distribution 


P(o2|yd 


,-E 

,M)a (a  )  L 


rn  +  v. 

0  i  1 

r+v°i 

[  2 

(5d5 

2 

exp 


-i(v0  ♦  n)o2 


(A.4) 


However  rational  acceptance  of  the  relevance  of  this  model  for  the  situation  in  which 

y  is  generated  requires  that  relevant  aspects  of  y  are  not  surprising  when  assessed 

-  d  -a 

against  a  reference  distribution  derived  from  the  predictive  distribution. 


p(y|M)a 


,  2V  2 

IV _ 

n+v. 


(A. 5) 


,~2,  2 
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to  a  t  table. 


(ii)  *n(y  -  01/s^ 

(iii)  to  the  reference  distribution  obtained  by  appropriate  integration 

of  p(u). 

Inferences  about  the  variance 


(a)  Suppose  Vq  •»  0. 


This  limit  corresponds  to  usual  noninformative  Jeffereys'  prior.  Again  the  values 

of  vQ  that  could  represent  real  situations  could  approach  zero  but  not  reach  it. 

2 

Since  in  practice  there  could  always  be  values  of  s  which  would  be  surprising  even 
2 

though  p(o  | M)  was  disperse,  this  would  correspond  to  the  situation  where  a  very  small 
value  of  P<FVV  )  was  found  even  though  v0  was  very  small. 

(b)  Suppose  vQ  is  very  large 


2  2  2  2 

Then  s„  and  s  =  (vs  +  v„s„)/(v  +  v„)  are  very  precisely  known  and  if  we 
0  p  0  0  o  — 

believe  the  model  the  posterior  distribution  p(o^|y,M),  is  sharply  concentrated  about 


2 

S0 
obtain 


Sq  and  our  belief  a  posteriori  is  the  same  as  that  a  priori.  However  for  p(y|n)  we 


i")  -  ^ 

p//n  p/'n 


(A. 10) 


where  z  is  a  unit  normal  deviate  and 

2 


.(4  w  -  4-;  ■  4} 


(A. 11) 


_  £. 

So  that  it  is  only  after  applying  the  checks  c(y)  and  c(s  )  as  well  as  c^ (u)  that 
we  could  logically  use  Bayes  theorem. 
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